The Mirage of Action-Dependent Baselines in Reinforcement Learning

نویسندگان

George Tucker

Surya Bhupatiraju

Shixiang Gu

Richard E. Turner

Zoubin Ghahramani

Sergey Levine

چکیده

Policy gradient methods are a widely used class of model-free reinforcement learning algorithms where a state-dependent baseline is used to reduce gradient estimator variance. Several recent papers extend the baseline to depend on both the state and action and suggest that this significantly reduces variance and improves sample efficiency without introducing bias into the gradient estimates. To better understand this development, we decompose the variance of the policy gradient estimator and numerically show that learned state-actiondependent baselines do not in fact reduce variance over a state-dependent baseline in commonly tested benchmark domains. We confirm this unexpected result by reviewing the open-source code accompanying these prior papers, and show that subtle implementation decisions cause deviations from the methods presented in the papers and explain the source of the previously observed empirical gains. Furthermore, the variance decomposition highlights areas for improvement, which we demonstrate by illustrating a simple change to the typical value function parameterization that can significantly improve performance.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Action-dependent Factorized Baselines

Policy gradient methods have enjoyed great success in deep reinforcement learning but suffer from high variance of gradient estimates. The high variance problem is particularly exasperated in problems with long horizons or high-dimensional action spaces. To mitigate this issue, we derive a bias-free action-dependent baseline for variance reduction which fully exploits the structural form of the...

متن کامل

Variance Reduction for Policy Gradient with Action-Dependent Factorized Baselines

متن کامل

Action-dependent Factorized Baselines

متن کامل

Reinforcement learning based feedback control of tumor growth by limiting maximum chemo-drug dose using fuzzy logic

In this paper, a model-free reinforcement learning-based controller is designed to extract a treatment protocol because the design of a model-based controller is complex due to the highly nonlinear dynamics of cancer. The Q-learning algorithm is used to develop an optimal controller for cancer chemotherapy drug dosing. In the Q-learning algorithm, each entry of the Q-table is updated using data...

متن کامل

Policy Gradient Methods for Reinforcement Learning with Function Approximation and Action-Dependent Baselines

1 Notation and Background We assume that the reader is familiar with the seminal paper by Sutton et al. (2000), which shows how the policy gradient theorem (Baxter and Bartlett, 1999) can be extended to include function approximation and action-independent baselines. Our paper is intended to be read immediately after reviewing Section 3 of the paper by Sutton et al. (2000). Although here we ado...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

CoRR

دوره abs/1802.10031 شماره

صفحات -

تاریخ انتشار 2018

The Mirage of Action-Dependent Baselines in Reinforcement Learning

نویسندگان

چکیده

منابع مشابه

Action-dependent Factorized Baselines

Variance Reduction for Policy Gradient with Action-Dependent Factorized Baselines

Action-dependent Factorized Baselines

Reinforcement learning based feedback control of tumor growth by limiting maximum chemo-drug dose using fuzzy logic

Policy Gradient Methods for Reinforcement Learning with Function Approximation and Action-Dependent Baselines

عنوان ژورنال:

اشتراک گذاری